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Abstract 

The common internal structure and algorithmic 
organization of object detection, detection-based 
tracking, and event recognition facilitates a gen- 
eral approach to integrating these three compo- 
nents. This supports multidirectional informa- 
tion flow between these components allowing 
object detection to influence tracking and event 
recognition and event recognition to influence 
tracking and object detection. The performance 
of the combination can exceed the performance 
of the components in isolation. This can be done 
with linear asymptotic complexity. 



1 Introduction 



Many common approaches to event recognition fSiskind' 



and MorrisI 1996; Starner etaLl 19981 Wang et al., 2009; 



Xu et al.| 2002 2005 ) classify events based on their motion 
profile. This requires detecting and tracking the event par- 
ticipants. Adaptive approaches to tracking ( Yilmaz et al.| 



2006| ), e.g. Kalman filtering ( Comaniciu et al.| 2003 ), suf- 
fer from three difficulties that impact their utility for event 
recognition. First, they must be initialized. One cannot 
initialize on the basis of motion since many event partici- 
pants move only for a portion of the event, and sometimes 
not at all. Second, they exhibit drift and often must be 
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periodically reinitialized to compensate. Third, they have 
difficulty tracking small, deformable, or partially occluded 
objects as well as ones whose appearance changes dramati- 
cally. This is particularly of concern since many events, e.g. 
picking things up, involve humans interacting with objects 
that are sufficiently small for humans to grasp and where 
such interaction causes appearance change by out-of-plane 
rotation, occlusion, or deformation. 

Detection-based tracking is an alternate approach that at- 
tempts to address these issues. In detection-based track- 
ing an object detector is applied to each frame of a video 
to yield a set of candidate detections which are composed 
into tracks by selecting a single candidate detection from 
each frame that maximizes temporal coherency of the track. 
However, current object detectors are far from perfect. On 
the PASCAL VOC Challenge, they typically achieve av- 
erage precision scores of 40% to 50% (Everingham et al.| 
2010| ). Directly applying such detectors on a per-frame ba- 
sis would be ill-suited to event recognition. Since the fail- 
ure modes include both false positives and false negatives, 
interpolation does not suffice to address this shortcoming. 
A better approach is to combine object detection and track- 
ing with a single objective function that maximizes tempo- 
ral coherency to allow object detection to inform the tracker 
and vice versa. 

One can carry this approach even further and integrate 
event recognition with both object detection and tracking. 
One way to do this is to incorporate coherence with a target 
event model into the temporal coherency measure. For ex- 
ample, a top-down expectation of observing a pick up event 
can bias the object detector and tracker to search for event 
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participants that exhibit the particular joint motion profile 
of that event: an object in close proximity to the agent, the 
object starting out at rest while the agent approaches the 
object, then the agent touching the object, followed by the 
object moving with the agent. Such information can also 
flow bidirectionally. Mutual detection of a baseball bat and 
a hitting event can be easier than detecting each in isolation 
or having a fixed direction of information flow. 

The common internal structure and algorithmic organi- 
zation of current object detectors ([F elzenszwalb et aT) 
[20T0 a b ), detection-based trackers ( Wol f et al.||1989| ), and 
HMM-based approaches to event recognition ("Baum and' 
Petrie 1966| ) facilitates a general approach to integrating 
these three components. We demonstrate an approach to 
integrating object detection, tracking, and event recogni- 
tion and show how it improves each of the these three com- 
ponents in isolation. Further, while prior detection-based 
trackers exhibit quadratic complexity, we show how such 
integration can be fast, with linear asymptotic complexity. 

2 Detection-based tracking 

The methods described in sections [4j|5j and|6]extend a pop- 
ular dynamic-programming approach to detection-based 
tracking. We review that approach here to set forth the 
concepts, terminology, and notation that will be needed to 
describe the extensions. 

Detection-based tracking is a general framework where an 
object detector is applied to each frame of a video to yield a 
set of candidate detections which are composed into tracks 
by selecting a single candidate detection from each frame 
that maximizes temporal coherency of the track. This gen- 
eral framework can be instantiated with answers to the fol- 
lowing questions: 

1. What is the representation of a detectionl 

2. What is the detection sourcel 

3. What is the measure of temporal coherency? 

4. What is the procedure for finding the track with max- 
imal temporal coherency? 

We answer questions [T] and |2] by taking a detection to be 
a scored axis-aligned rectangle (box), such as produced by 
the Fel zenszwalb et al.| ( |2QlQa|b| ) object detectors, though 
our approach is compatible with any method for produc- 
ing scored axis-aligned rectangular detections. If 6^ de- 
notes the jth detection in frame t, denotes the score 
of that detection, T denotes the number of frames, and 
j = (ji, . . . , Jt) denotes a track comprising the j^th de- 
tection in frame t, we answer question |3] by formulating 
temporal coherency of a track j = (ji, . . . , Jt) as: 



t = 2 



t = 3 



t = T 



,.«iax Y^mj + EaibfAJ (1) 

^^''"'^^ t=l t=2 

where g scores the local temporal coherency between de- 
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Figure 1: The tracking lattice constructed by the Viterbi 
algorithm performing detection-based tracking. 

tections in adjacent frames. We take g to be the negative 
Euclidean distance between the center of and the center 
of b^j~}^ projected forward one frame, though, as discussed 
below, our approach is compatible with a variety of func- 
tions discussed by [Felzenszwalb and Huttenlocher| ( [2QQ4| ). 
The forward projection internal to g can be done in a vari- 
ety of ways including optical flow and the Kanade-Lucas- 
Tomasi (KLT) ( |Shi and Tomasi[p^94{|Tomasi and Kanade| 
1991 ) feature tracker. We answer question [4] by observing 
that Eq. [T] can be optimized in polynomial time with the 
Viterbi algorithm ( | Viterbi |p^7T] ): 



Ito Jido S]:=f{b]) 



for j 
for t = 2 to T 
do for J = 1 to Jt 



(2) 



do^j: 



Jt-1 .^f_ 

max g[b-, 

.7' = 1 



.-1 ,p 



where Jt is the number of detections in frame t. This leads 
to a lattice as shown in Fig.[T] 

Detection-based trackers exhibit less drift than adaptive ap- 
proaches to tracking due to fixed target models. They also 
tend to perform better than simply picking the best detec- 
tion in each frame. The reason is that one can allow the 
detection source to produce multiple candidates and use 
the combination of the detection score / and the adjacent- 
frame temporal-coherency score g to select the track. The 
essential attribute of detection-based tracking is that g can 
overpower / to assemble a more coherent track out of 
weaker detections. The nonlocal nature of Eq.[T]can allow 
more-reliable tracking with less-reliable detection sources. 

A crucial practical issue arises: How many candidate de- 
tections should be produced in each frame? Producing too 
few may risk failing to produce the desired detection that 
is necessary to yield a coherent track. In the limit, it is im- 
possible to construct any track if even a single frame lacks 
any detections [] The current state-of-the-art in object de- 
tection is unable to simultaneously achieve high precision 



One can ameliorate this somewhat by constructing a lattice 
that skips frames ( |Sala et al.n2010j ). This increases the asymp- 
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and recall and thus it is necessary to explore the trade-off 
between the two ( [Everingham et al.[ |201Q| ). A detection- 
based tracker can bias the detection source to yield higher 
recall at the expense of lower precision and rely on tempo- 
ral coherency to compensate for the resulting lower preci- 
sion. This can be done in at least three ways. First, one can 
depress the detection- source acceptance thresholds. One 
way this can be done with the Felzenszwalb et al. detectors 
is to lower the trained model thresholds. Second, one can 
pool the detections output by multiple detection sources 
with complementary failure modes. One way this can be 
done is by training multiple models for people in differ- 
ent poses. Third, one can use adaptive-tracking methods 
to project detections forward to augment the raw detector 
output and compensate for detection failure in subsequent 
frames. This can be done in a variety of ways including 
optical flow and KLT. The essence of our paper is a more 
principled collection of approaches for compensating for 
low recall in the object detector. 

A practical issue arises when pooling the detections out- 
put by multiple detection sources. It is necessary to nor- 
malize the detection scores for such pooled detections by a 
per-model offset. One can derive an offset by computing a 
histogram of scores of the top detection in each frame of a 
video and taking the offset to be the minimum of the value 
that maximizes the between-class variance ( |Otsu[ |1979| ) 
when bipartitioning this histogram and the trained accep- 
tance threshold offset by a small but fixed amount. 

The operation of a detection-based tracker is illustrated 
in Fig. |2] This example demonstrates several things of 
note. First, reliable tracks are produced despite an unre- 
liable detection source. Second, the optimal track contains 
detections with suboptimal score. Row (b) demonstrates 
that selecting the top-scoring detection does not yield a 
temporally-coherent track. Third, forward-projection of 
detections from the second to third column in row (c) com- 
pensates for the lack of raw detections in the third column 
of row (a). 

Detection-based tracking runs in time O(TJ^) on videos 
of length T with J detections per frame. In practice, the 
run time is dominated by the detection process and the 
dynamic-programming step. Limiting J to a small num- 
ber speeds up the tracker considerably while minimally im- 
pacting track quality. We further improve the speed of the 
detectors when running many object classes by factoring 
the computation of the HOG pyramid. 

3 Evaluation of detection-based tracking 

We evaluated detection-base tracking using the year-one 
(Yl) corpus produced by DARPA for the Mind's Eye pro- 



gram. These videos are provided at 720p@30fps and range 
from 42 to 1727 frames in length, with an average of 438.84 
frames, and depict people interacting with a variety of ob- 
jects to enact common English verbs. 



Four Mind's Eye teams (University at Buffalo, Corso 
(2011 , Stanford Research Institute, |Bui||201l1 University 
of California at Berkeley, |Saenko 



2011 and University 



of Southern California, Navatia||2011| ) independently pro 



totic complexity to be exponential in the number of frame skips 
allowed. 



duced human-annotated tracks for different portions of Yl . 
We used these sources of human-annotated tracks to evalu- 
ate the performance of detection-based tracking by com- 
puting human-human intercoder agreement between all 
pairs of the four sources of human- annotated tracks and 
human-machine intercoder agreement between a detection- 
based tracker and all four of these sources. Since each team 
annotated different portions of Yl, each such intercoder 
agreement measure was computed only over the N videos 
shared by each pair, as reported in Table [ij^)- Oii^ team 
(University at Buffalo, Corso 2011 ) annotated detections as 
clusters of quadrilaterals around object parts. These were 
converted to a single bounding box. 

Different teams labeled the tracks with different class la- 
bels. It was possible to determine from these labels whether 
the track was for a person or nonperson by assuming that 
the labels 'person' and 'human', and only those labels, de- 
noted person tracks, but it was not possible to automati- 
cally make finer- grained class comparisons. Thus we inde- 
pendently compared person tracks with person tracks and 
nonperson tracks with nonperson tracks. When comparing 
an annotation of a video n containing person tracks 
with an annotation v of that same video containing per- 
son tracks, we compared person tracks. We se- 
lected the best over all [/J^, ! permutation mappings pn 
between person tracks in u and person tracks mv. A per- 
mutation mapping was preferred when it had higher aver- 
age overlap score among corresponding boxes across the 
tracks and the frames in a video, where the overlap score 
was that used by the PASCAL VOC Challenge ( |Evering 
ham et al. 2010| ), namely the ratio of the area of their inter 
section to the area of their union. Different tracks could an 
notate different frames of a video. When comparing such, 
we considered only the shared frames. 

For every pair of teams, we computed the mean and stan- 
dard deviation of the overlap score across all shared frames 
in all tracks in the best permutation mappings for all shared 
videos. The averaging process used both to determine 
the best permutation mapping for each video pair and to 
determine overall mean and standard deviation measures 
weighted each overlap score equally. More precisely, if 
I < n < N,l <l < denotes a shared track for 

video n, denotes the set of shared frames for that shared 
track / in video n, and denote the vector of boxes 
for frame t in video n for annotations u and v respectively, 
and O denotes the overlap measure, we score a permutation 
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Figure 2: The operation of a detection-based tracker, (a) Output of the detection sources, biased to yield false positives, 
(b) The top- scoring output of the detection source, (c) Augmenting the output of the detection sources with forward- 
projected detections, (d) The optimal tracks selected by the Viterbi algorithm. 



mapping pn for video n as: 



1 



^=1 teTi 



1=1 



and computed the mean overlap for a pair of teams as: 



N ICC] 



E E l^n 

n=l ^=1 



■E E ^oip^iummi]) 



n=l 1=1 teTl 



with an analogous computation for standard deviation and 
nonperson tracks. 

The overall mean and standard deviation measures, re- 
ported in Table [TJb,c), indicate that the mean human- 
human overlap is only marginally greater than the mean 
human-machine overlap by about one standard deviation. 
This suggests that improvement in tracker performance is 



unlikely to lead to significant improvement in action recog- 
nition performance and sentential description quality. 

4 Combining object detection and tracking 

While detection-based tracking is resilient to low precision, 
it requires perfect recall; it cannot generate a track through 
a frame that has no detections and it cannot generate a track 
through a portion of the field of view which has no detec- 
tions regardless of how good the temporal-coherence of the 
resulting track would be. This brittleness means that any 
detection source employed will have to significantly over- 
generate detections to achieve near-perfect recall. This has 
a downside. While the Viterbi algorithm has linear com- 
plexity in the number of frames, it is quadratic in the num- 
ber of detections per frame. This drastically limits the num- 
ber of detections that can reasonably be processed lead- 
ing to the necessity of tuning the thresholds on the detec- 
tion sources. We have developed a novel mechanism to 
eliminate the need for a threshold and track every possi- 
ble detection, at every position and scale in the image, in 
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Table 1 : (a) The number of videos in common, (b) the mean overlap, and (c) the standard deviation in overlap between 
each pair of annotation sources. 



time linear in the number of detections and frames. At the 
same time our approach eliminates the need for forward 
projection since every detection is already present. Our 
approach involves simultaneously performing object detec- 
tion and tracking, optimizing the joint object-detection and 
temporal-coherency score. 

Our general approach is to compute the distance between 
pairs of detection pyramids for adjacent frames, rather than 
using g to compute the distance between pairs of individual 
detections. These pyramids represent the set of all possi- 
ble detections at all locations and scales in the associated 
frame. Employing a distance transform makes this pro- 
cess linear in the number of location and scale positions 
in the pyramid. Many detectors, e.g. those of Felzenszwalb 
et al., use such a scale-space representation of frames to 
represent detections internally even though they might not 
output such. Our approach requires instrumenting such a 
detector to provide access to this internal representation. 

At a high-level, the Felzenszwalb et al. detectors learn a 
forest of HOG ( [Freeman and Roth[ |1995| ) filters for each 
object class along with their characteristic displacements. 
Detection proceeds by applying each HOG filter at ev- 
ery position in an image pyramid followed by comput- 
ing the optimal displacements at every position in that 
image pyramid, thereby creating a new pyramid, the de- 
tection pyramid. Finally, the detector searches the de- 
tection pyramid for high- scoring detections and extracts 
those above a threshold. The detector employs a dynamic- 
programming algorithm to efficiently compute the optimal 
part displacements for the entire image pyramid. This al- 
gorithm (Felzenszwalb et al. 2010a) is very similar to the 



Viterbi algorithm. It is made tractable by the use of a gener- 
alized distance transform (^ Felzenszwalb and Huttenlocher| 
2004 ) that allows it to scale linearly with the number of 
image pyramid positions. Given a set Q of points (which 
in our case denotes an image pyramid), a distance metric d 
between pairs of points p and q, and an arbitrary function 
: ^ ^ 3?, the generalized distance transform D(j){q) com- 
putes: 



DM 



mm{d{p, q) 



in linear time for certain distance metrics including squared 
Euclidean distance. 



Instead of extracting and tracking just the thresholded de- 
tections, one can directly track all detections in the entire 
pyramid simultaneously by defining a distance measure be- 
tween detection pyramids for adjacent frames and perform- 
ing the Viterbi tracking algorithm on these pyramids in- 
stead of sets of detections in each frame. To allow compar- 
ison between detections at different scales in the detection 
pyramid, we convert the detection pyramid to a rectangular 
prism by scaling the coordinates of the detections at scale s 
by 7r(s), chosen to map the detection coordinates back to 
the coordinate system of the input frame. We define the 
distance between two detections, b and b\ in two detection 
pyramids as a scaled squared Euclidean distance: 



{7r{s)y - 7r{s')y')^ 



(3) 



where x and y denote the original image coordinates of a 
detection center at scale s. Nominally, detections are boxes. 
Comparing two such boxes involves a four-dimensional 
distance metric. However, with a detection pyramid, the 
aspect ratio of detections is fixed, reducing this to a three- 
dimensional distance metric. The coefficient a in the dis- 
tance metric weights a difference in detection area differ- 
ently than detection position. 

The above amounts to replacing detections 6^ with 6^^^, 
lattice values (5j with 5^^^, and Eq.[2]with: 



1 to 5 do 5ly, : 



for X = 1 to X 
do for y = lioY 

do for s 
for t = 2 to T 
do for X = Ito X 

do for y = lioY 
do for s = 1 to 5 



(4) 



■-fiblys) 



do Siy^ : 



■fiKys) 



V^^'^,9(Pl'y's'^^lys)^^l'y's' 



The above formulation allows us to employ the generalized 
distance transform as an analog to g in Eq.[T] although it re- 
stricts consideration of g to be squared Euclidean distance 
rather than Euclidean distance. We avail ourselves of the 
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fact that the generaUzed distance transform operates inde- 
pendently on each of the three dimensions x, and s in 
order to incorporate a into Eq. [3] While linear-time use of 
the distance transform restricts the form of ^, it places no 
restrictions on the form of /. 

One way to view the above is that the vector of 5j for all 
I < j < Jt from Eq.[2]is being represented as a pyramid 
and the loop: 



for ji = 1 to Jt 



(5) 



Jt-1 

.7' = 1 ^ 



t-l 



is being performed as a linear-time construction of a gen- 
eralized distance transform rather than a quadratic-time 
nested pair of loops. Another way to view the above 
is that we generalize the notion of a detection pyra- 
mid from representing per-frame detections bxys at three- 
dimensional pyramid positions (x^y^s) to representing 
per- video detections 6^,^^ at four-dimensional pyramid po- 
sitions (x^y^s^t) and finding a sequence of per- video de- 
tections for 1 < t < T that optimizes the following variant 
ofEq.[T| 

T T 



yi,- 



t=2 



This combination of the detector and the tracker is perform- 
ing simultaneous detection and tracking integrating the in- 
formation between the two. Before, the tracker was af- 
fected by the detector but the detector was unaffected by 
the tracker: potential low- scoring but temporally-coherent 
detections would not even be generated by the detector de- 
spite the fact that they would yield good tracks. Because 
now, the detector no longer chooses which detections to 
produce but instead scores all detections at every position 
and scale, the tracker is able to choose among any possible 
detection. Such tight integration of higher- and lower-level 
information will be revisited when integrating event mod- 
els into this framework. 

5 Combining tracking and event detection 

It is popular to use Hidden Markov Models (HMMs) to per- 



form event recognition ( Siskind and Morris 
eFaLl p^98l |Wang et al.| |2QQ9| |Xu eTaL 



1996^ Stamer 



2QQ2| |20Q5| ). 



When doing so, the log likelihood of a video conditioned 
on an event model is: 



kt-i) 



t=2 



where kt denotes the state of the HMM for frame t, h{k^b) 
denotes the log probability of generating a detection b con- 
ditioned on being in state k, a{k^k') denotes the log prob- 
ability of transitioning from state k to k' , and denotes 



index of the detection produced by the tracker in frame t. 
This log likelihood can be computed with the forward al- 



gorithm ( ^Baum and Petrie|[1966 ) which is analogous to the 
Viterbi algorithm. Maximum likelihood (ML), the stan- 
dard approach to using HMMs for classification, selects the 
event model that maximizes the likelihood of an observed 
event. One can instead select the model with the maximum 
a posteriori (log) probability (MAP). 



max h{kt^ b^j* ) + a{k 

^^'■■■'^^t=l t=2 



(7) 



This can be computed with the Viterbi algorithm. The ad- 
vantage of doing so is that one can combine the Viterbi al- 
gorithm used for detection-based tracking with the Viterbi 
algorithm used for event classification. 

One can combine Eq. [T] with Eq. [7] to yield a unified cost 
function: 

T T 

max max E /(^D + E ' (8) 

' ' ' ^=1 t=2 

T T 



^^hikt.bl) ^^aikuh-i) 



that computes the joint MAP of the best possible track and 
the best possible state sequence by replacing with jt in- 
side nested quantification. This too can be computed with 
the Viterbi algorithm, taking the lattice values to be in- 
dexed by the detection index j and the state k, forming the 
cross product of the tracker lattice nodes and the event lat- 
tice nodes: 



for j = 1 to Ji 

do for A: = 1 to do S]j^ := f{b]) 
for t = 2 to T 
do for J = 1 to Jt 
do for A: = 1 to 



h{kM) 



(9) 



do^^,:= 



h{k,b'^) 



J 

+ maxmax g{b\~^ ^b^A ^ a(k^k') 

This finds the optimal path through a graph where the nodes 
at every frame represent the cross product of the detections 
and the HMM states. 

Doing so performs simultaneous tracking and event clas- 
sification. Before, the event classifier was affected by the 
tracker but the tracker was unaffected by the event classi- 
fier: potential low- scoring tracks would not even be gener- 
ated by the tracker despite the fact that they would yield a 
high MAP estimate for some event class. Because now, the 
tracker no longer chooses which tracks to produce but in- 
stead scores all tracks, the event classifier is able to choose 
among any possible track. This amounts to a different kind 
of track-coherence measure that is tuned to specific events. 



6 



Such a measure might otherwise be difficult to achieve 
without top-down information from the event classifier. For 
example applying this method to a video of a running per- 
son along with an event model for running, will be more 
likely to compose a track out of person detections that has 
high velocity and low change in direction. 

Processing each frame t with the algorithm in Eq. [9] is 
quadratic in JtK. This can be problematic since JtK can 
be large. As before, we can make this linear in Jt using a 
generalized distance transform. One can make this linear 
in K for suitable state-transition functions a ( |Felzenszwalb| 
[etaL||2Q03] ). 

Two practical issues arise when applying the above method. 



First, one can factor Eq. 10 as Eq. 11 



Jt 

max 

j 



aax (g{b^j, +max (a{k,k') ^ S^.,j^]^ 



Jt- 

max 

j' 



(10) 



(11) 



This is important because the computation of g{h^-, , 6p 
might be expensive as it involves a projection of 6^7^ for- 
ward one frame (e.g. using optical flow or KLT). Second, 
when applying this method to multiple event models, the 
same factorization can be extended to cache the compu- 
tation of g{b*j7^^b*j) across different event models as this 
term does not depend on the event model. 



6 Combining object detection, tracking and 
event detection 

One can combine the methods of Sections]?] and [5] to opti- 
mize a cost function: 



yi,...,yT t=l 
si,...,st t 



.^^^.^ EM.y.sJ + '^M.y.sJ (12) 

1 

T 
t=2 

that combines Eq.[6]with Eq.[8]by forming a large Viterbi 
lattice with values S^^y^j^. 

One practical issue arises when applying the above method. 
In Eq. [121 h is 3. function of h\.^y^^^, the detection in the 
current irame. This allows the HMM event model to de- 
pend on static object characteristics such as position, shape, 
and pose. However, many approaches to event recognition 
using HMMs use temporal derivatives of such characteris- 
tics to provide object velocity and acceleration information 
( Siskind and Morris|p^96l|Starner et al.|,199 8). Having h 
also be a function of b^r\„ , ^ , , the detection in the pre- 

j^t — iyt—i'^t—i ^ 

vious frame, requires incorporation h into the generalized 
distance transform and thus restricts its form. 



The above combination performs simultaneous object de- 
tection, tracking, and event classification, integrating infor- 
mation across all three. Without such information integra- 
tion, the object detector is unaffected by the tracker which 
is in turn unaffected by the event model. With such inte- 
gration, the event model can influence the tracker and both 
can influence the object detector. 

This is important because current object detectors cannot 
reliably detect small, deformable, or partially occluded ob- 
jects. Moreover, current trackers also fail to track such 
objects. Information from the event model can focus the 
object detector and tracker on those particular objects that 
participate in a specified event. An event model for rec- 
ognizing an agent picking an object up can bias the object 
detector and tracker to search for an object that exhibits a 
particular profile of motion relative to the agent, namely 
where the object is in close proximity to the agent, the ob- 
ject starts out being at rest while the agent approaches the 
object, then the agent touches the object, followed by the 
object moving with the agent. 

A traditional view of the relationship between object and 
event detection suggests that one recognizes a hammering 
event, in part, because one detects a hammer. Our unified 
approach inverts the traditional view, suggesting that one 
can recognize a hammer, in part, by detecting a hammering 
event. Furthermore, a strength of our approach is that such 
relationships are not encoded explicitly, do not have to be 
annotated in the training data for the event models, and are 
learned automatically as part of learning the parameters of 
the different event models. This is to say that the relation- 
ship between a person and the objects they manipulate can 
be learned from the co-occurrence of tracks in the training 
data, rather than from manually annotated symbolic rela- 
tionships. 

7 Experimental results 

Figure [3] demonstrates improved performance of simulta- 
neous object detection and tracking (c) over object detec- 
tion (a) and tracking (b) in isolation. This happens for dif- 
ferent reasons: motion blur, even for large objects, can lead 
to poor detection results and hence poor tracks, small ob- 
jects are difficult to detect and track, and integration can 
improve detection and tracking of deformable objects, such 
as a person transitioning from an upright pose to sitting 
down. 

Figure [4] demonstrates improved performance of simulta- 
neous tracking and event recognition (c) over tracking (b) 
in isolation. These results were obtained with object and 
event models that were trained independently^ The object 



It would appear possible t o co- train object and event mod- 
els by combining Baum- Welch (|Baum|[T972 Baum et al. [1970 



with the training procedure for object models 
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(a) (b) (c) 

Figure 3: Improved performance of simultaneous object detection and tracking, (a) Output of the Felzenszwalb et al. 
detector, (b) Tracks produced by detection-based tracking, (c) Tracks produced by simultaneous object detection and 
tracking. 
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(a) (b) (c) 

Figure 4: Improved performance of simultaneous tracking and event recognition, (a) Output of the Felzenszwalb et al. 
detector, (b) Tracks produced by detection-based- tracking, (c) Tracks produced by simultaneous tracking and event recog- 
nition. 
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(a) (b) (c) 

Figure 5: Improved performance of simultaneous object detection, tracking, and event recognition, (a) Output of the 
Felzenszwalb et al. detector, (b) Tracks produced by detection-based- tracking, (c) Tracks produced by simultaneous 
object-detection, tracking, and event recognition. 



models were trained on isolated frames using the standard 
Felzenszwalb training software. The event models were 
trained using tracks produced by the detection-based track- 
ing method described in Section[2j It is difficult to track the 
person running with detection-based tracking alone due to 
articulated appearance change and motion blur. Imposing 
the prior of detecting running biases the tracker to find the 
desired track. 

Figure [5] demonstrates improved performance of simulta- 
neous object detection, tracking, and event recognition (c) 
over object detection (a) and tracking (b) in isolation. As 
before, these results were obtained with object and event 
models that were trained independently. 



by introducing time into their cost functions, thus tracking 
every possible detection in each frame. Furthermore, the 
distance transform can be used to reduce the complexity 
of doing so from quadratic to linear. The common inter- 
nal structure and algorithmic organization of object detec- 
tion, detection-based tracking, and event recognition fur- 
ther allows an HMM-based approach to event recognition 
to be incorporated into the general dynamic-programming 
approach. This facilitates multidirectional information flow 
where not only can object detection influence tracking and, 
in turn, event recognition, event recognition can influence 
tracking and, in turn object detection. 



8 Conclusion 
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